Exploiting instruction- and data-level parallelism

نویسندگان

  • Roger Espasa
  • Mateo Valero
چکیده

istorically, computer architects have taken two different approaches to high-performance computing: instruction level parallelism and data-level par-allelism. The ILP paradigm seeks to execute several instructions each cycle. It does this by exploring a sequential instruction stream and extracting independent instructions to send to several execution units in parallel. The DLP paradigm, on the other hand, uses vectoriza-tion techniques. A vector instruction specifies a series of operations to be performed on a stream of data. Each operation performed on each individual element is independent of all others, and, therefore, a vector instruction is highly parallel and can be easily pipelined. In this article, we propose a third approach to high-performance computing that combines the best of ILP and DLP techniques to provide an order of magnitude increase in performance at low complexity. Figure 1 illustrates three microarchitecture generations in the DLP world. The first vector generation, shown in Figure 1a, introduced in-order, pipelined execution of vector instructions. This generation's proto-typical machine is Cray Research's Cray-1. The second DLP generation, in Figure 1b, exploited the parallel semantics of vector instructions to implement multipipe functional units—unit replication that allows processing of more than one pair of operands per cycle. Cray Research's C90 or Nippon Electric Corp.'s SX-3 exemplified the multipipe processor. However, this generation still used the in-order execution model. Useful ILP techniques such as out-of-order execution or register renaming, which fight memory latency and improve processor throughput in the microprocessor world, have never been used in commercial vector computers. The third DLP generation, depicted in Figure 1c, is the one we are proposing in this article. This processor merges ILP and DLP in a single-processor architecture that combines three key technologies: • vector instructions, • out-of-order execution with register renaming, and • simultaneous multithreaded execution. Vectorizable code For many years, most scientific computing applications have largely followed the DLP model. Much of the vectorizable code optimized for yesterday's vector supercom-puters runs on today's superscalar microprocessors. These codes still retain their DLP characteristics. Moreover, in recent years applications containing highly regular DLP code have multiplied. In particular, many DSP and multimedia applications—graph-ics, compression, encryption—are superbly suited for vector implementation. Vector instruction sets and vector archi-tectures are an excellent match for the characteristics of data-parallel codes. Other architectures such as chip multiprocessors or multiscalar processors 2 are also good candidates to extract high performance from data-parallel code. However, vector instruction sets use fewer processor …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Multi - Grained Parallelism for Multiple - Instruction - Stream Architectures

Exploiting parallelism is an essential part of maximizing the performance of an application on a parallel computer. Parallelism is traditionally exploited at two granularities: individual operations are executed in parallel within a processor to exploit instruction-level parallelism and loop iterations or processes are executed in parallel on different processors to exploit loop-level paralleli...

متن کامل

Performance Study of a Concurrent Multithreaded Processor

The performance of a concurrent multithreaded architectural model, called superthreading [15], is studied in this paper. It tries to integrate optimizing compilation techniques and run-time hardware support to exploit both thread-level and instruction-level parallelism, as opposed to exploiting only instruction-level parallelism in existing superscalars. The superthreaded architecture uses a th...

متن کامل

EE 382C Embedded Software Systems Project Proposal

Objective: The goal of this project is to evaluate the effectiveness of two different techniques for exploiting the Instruction Level Parallelism (ILP) available in Digital Signal Processing (DSP) and Multimedia applications. VLIW (Very Long Instruction Word) architectures have multiple functional units to take advantage of such a parallelism, while the SIMD (Single Instruction Multiple Data) a...

متن کامل

The Potential of Exploiting Coarse-Grain Task Parallelism from Sequential Programs

Research into automatic extraction of instruction-level parallelism and data parallelism from sequential languages by compilers has been going on for many years. However, task parallelism has been almost unexploited by parallelizing compilers. It has been shown that coarse-grain task parallelism is a useful additional resource of parallelism for multiprocessors, but the simple and restricted ex...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Micro

دوره 17  شماره 

صفحات  -

تاریخ انتشار 1997